resource class
Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
Srinivas, Pooja, Husain, Fiza, Parayil, Anjaly, Choure, Ayush, Bansal, Chetan, Rajmohan, Saravan
Cloud service owners need to continuously monitor their services to ensure high availability and reliability. Gaps in monitoring can lead to delay in incident detection and significant negative customer impact. Current process of monitor creation is ad-hoc and reactive in nature. Developers create monitors using their tribal knowledge and, primarily, a trial and error based process. As a result, monitors often have incomplete coverage which leads to production issues, or, redundancy which results in noise and wasted effort. In this work, we address this issue by proposing an intelligent monitoring framework that recommends monitors for cloud services based on their service properties. We start by mining the attributes of 30,000+ monitors from 791 production services at Microsoft and derive a structured ontology for monitors. We focus on two crucial dimensions: what to monitor (resources) and which metrics to monitor. We conduct an extensive empirical study and derive key insights on the major classes of monitors employed by cloud services at Microsoft, their associated dimensions, and the interrelationship between service properties and this ontology. Using these insights, we propose a deep learning based framework that recommends monitors based on the service properties. Finally, we conduct a user study with engineers from Microsoft which demonstrates the usefulness of the proposed framework. The proposed framework along with the ontology driven projections, succeeded in creating production quality recommendations for majority of resource classes. This was also validated by the users from the study who rated the framework's usefulness as 4.27 out of 5.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > Ohio (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Questionnaire & Opinion Survey (1.00)
- Research Report > Experimental Study (0.68)
- Information Technology > Cloud Computing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
X-lifecycle Learning for Cloud Incident Management using LLMs
Goel, Drishti, Husain, Fiza, Singh, Aditya, Ghosh, Supriyo, Parayil, Anjaly, Bansal, Chetan, Zhang, Xuchao, Rajmohan, Saravan
Incident management for large cloud services is a complex and tedious process and requires significant amount of manual efforts from on-call engineers (OCEs). OCEs typically leverage data from different stages of the software development lifecycle [SDLC] (e.g., codes, configuration, monitor data, service properties, service dependencies, trouble-shooting documents, etc.) to generate insights for detection, root causing and mitigating of incidents. Recent advancements in large language models [LLMs] (e.g., ChatGPT, GPT-4, Gemini) created opportunities to automatically generate contextual recommendations to the OCEs assisting them to quickly identify and mitigate critical issues. However, existing research typically takes a silo-ed view for solving a certain task in incident management by leveraging data from a single stage of SDLC. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance of two critically important and practically challenging tasks: (1) automatically generating root cause recommendations for dependency failure related incidents, and (2) identifying ontology of service monitors used for automatically detecting incidents. By leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate that augmenting contextual information from different stages of the SDLC improves the performance over State-of-The-Art methods.
- North America > United States > Ohio (0.04)
- North America > United States > Michigan (0.04)
Reinforcement Learning Agent Design and Optimization with Bandwidth Allocation Model
Reale, Rafael F., Martins, Joberto S. B.
Reinforcement learning (RL) is currently used in various real-life applications. RL-based solutions have the potential to generically address problems, including the ones that are difficult to solve with heuristics and meta-heuristics and, in addition, the set of problems and issues where some intelligent or cognitive approach is required. However, reinforcement learning agents require a not straightforward design and have important design issues. RL agent design issues include the target problem modeling, state-space explosion, the training process, and agent efficiency. Research currently addresses these issues aiming to foster RL dissemination. A BAM model, in summary, allocates and shares resources with users. There are three basic BAM models and several hybrids that differ in how they allocate and share resources among users. This paper addresses the issue of an RL agent design and efficiency. The RL agent's objective is to allocate and share resources among users. The paper investigates how a BAM model can contribute to the RL agent design and efficiency. The AllocTC-Sharing (ATCS) model is analytically described and simulated to evaluate how it mimics the RL agent operation and how the ATCS can offload computational tasks from the RL agent. The essential argument researched is whether algorithms integrated with the RL agent design and operation have the potential to facilitate agent design and optimize its execution. The ATCS analytical model and simulation presented demonstrate that a BAM model offloads agent tasks and assists the agent's design and optimization.
- South America > Brazil (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Direct Mappings between RDF and Property Graph Databases
Thakkar, Harsh, Angles, Renzo, Tomaszuk, Dominik, Lehmann, Jens
RDF [21] and Graph databases [27] are two approaches for data management that are based on modeling, storing and querying graph-like data. The database systems based on these models are gaining relevance in the industry due to their use in various application domains where complex data analytics is required [2]. RDF triplestores and graph database systems are tightly connected as they are based on graph data models. RDF databases are based on the RDF data model [21], their standard query language is SPARQL [15], and RDF Schema [8] allows to describe classes of resources and properties (i.e. the data schema). On the other hand, most graph databases are based on the Property Graph (PG) data model, there is no standard query language, and there is no standard notion of property graph schema [25]. Therefore, RDF and PG database systems are dissimilar in data model, schema constraints and query language.
- South America > Chile (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Poland (0.04)